Expectation-maximization algorithms for learning a finite mixture of univariate survival time distributions from partially specified class values
نویسندگان
چکیده
Heterogeneity exists on a data set when samples from different classes are merged into the data set. Finite mixture models can be used to represent a survival time distribution on heterogeneous patient group by the proportions of each class and by the survival time distribution within each class as well. The heterogeneous data set cannot be explicitly decomposed to homogeneous subgroups unless all the samples are precisely labeled by their origin classes; such impossibility of decomposition is a barrier to overcome for estimating finite mixture models. The expectation-maximization (EM) algorithm has been used to obtain maximum likelihood estimates of finite mixture models by soft-decomposition of heterogeneous samples without labels for a subset or the entire set of data. In medical surveillance databases we can find partially labeled data, that is, while not completely unlabeled there is only imprecise information about class values. In this study we propose new EM algorithms that take advantages of using such partial labels, and thus incorporate more information than traditional EM algorithms. We particularly propose four variants of the EM algorithm named EM-OCML, EM-PCML, EM-HCML and EM-CPCML, each of which assumes a specific mechanism of missing class values. We conducted a simulation study on exponential survival trees with five classes and showed that the advantages of incorporating substantial amount of partially labeled data can be highly significant. We also showed model selection based on AIC values fairly works to select the best proposed algorithm on each specific data set. A case study on a real-world data set of gastric cancer provided by Surveillance, Epidemiology and End Results (SEER) program showed a superiority of EM-CPCML to not only the other proposed EM algorithms but also conventional supervised, unsupervised and semi-supervised learning algorithms.
منابع مشابه
Learning mixture models – courseware for finite mixture models of multivariate Bernoulli distributions
Teaching of machine learning should aim at the readiness to understand and implement modern machine learning algorithms. Towards this goal, we often have course exercises involving the student to solve a practical machine learning problem involving a reallife data set. The students implement the programs of machine learning methods themselves and gain deep insight on the implementation details ...
متن کاملExpectation Maximization and Complex Duration Distributions for Continuous Time Bayesian Networks
Continuous time Bayesian networks (CTBNs) describe structured stochastic processes with finitely many states that evolve over continuous time. A CTBN is a directed (possibly cyclic) dependency graph over a set of variables, each of which represents a finite state continuous time Markov process whose transition model is a function of its parents. We address the problem of learning the parameters...
متن کاملpoLCA: Polytomous Variable Latent Class Analysis Version 1.2
poLCA is a software package for the estimation of latent class and latent class regression models for polytomous outcome variables, implemented in the R statistical computing environment. Both models can be called using a single simple command line. The basic latent class model is a finite mixture model in which the component distributions are assumed to be multi-way cross-classification tables...
متن کاملpoLCA: An R Package for Polytomous Variable Latent Class Analysis
poLCA is a software package for the estimation of latent class and latent class regression models for polytomous outcome variables, implemented in the R statistical computing environment. Both models can be called using a single simple command line. The basic latent class model is a finite mixture model in which the component distributions are assumed to be multi-way cross-classification tables...
متن کاملEstimation of Generalized Multisensor Hidden Markov Chains and Unsupervised Image Segmentation
This paper attacks the problem of generalized multisensor mixture estimation. A distribution mixture is said to be generalized when the exact nature of components is not known, but each of them belongs to a finite known set of families of distributions. Estimating such a mixture entails a supplementary difficulty: One must label, for each class and each sensor, the exact nature of the correspon...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015